Every night, hundreds of thousands of tourists opt to pay for and stay in accommodations provided by strangers through the Airbnb website instead of booking traditional lodging such as hotels. Since its inception in 2008, Airbnb has offered an online platform where individuals can rent various types of properties, including rooms, apartments, houses, and occasionally more unique accommodations. Over the years, Airbnb has experienced rapid and extensive growth, making it possible for anyone to find and rent a place virtually anywhere in the world.
This report focuses on Paris, the capital of France, aiming to analyze general trends regarding the prices set by hosts in the city. Our analysis is structured around four main objectives. Firstly, we aim to identify the relationship between prices and apartment features, with a specific emphasis on understanding how various factors such as size, amenities, and location influence rental rates. Secondly, we will delve into the habits of Parisian hosts, seeking to determine the typical number of apartments each owner offers for rent, providing insights into the scale of their operations. Thirdly, we will adopt a geographical approach to assess the renting prices per city quarter, known as “arrondissements,” examining how different areas within Paris correlate with varying price ranges. Finally, we will longitudinally examine the visit frequency of the different quarters over time, providing insights into the popularity and demand dynamics of various neighborhoods in Paris among Airbnb users.
The exercise comprises the following folders and files: - The
app.R R script, which contains the Shiny web application
including both the server and the user interface. - The data provided
for the development of this exercise is stored in an .RData file named
AirBnB.RData. This file contains data related to Airbnb
listings in Paris.
For this exercise, the objective is to explore and analyze the Paris dataset by creating a Shiny application. The application should include the following functionalities:
Relationship between Prices and Apartment Features: Analyze the relationship between rental prices and various apartment features such as the number of bedrooms, bathrooms, beds, and capacity to accommodate guests. Visualize this relationship through interactive plots or charts.
Number of Apartments per Owner: Calculate and display the number of apartments owned by each host. This analysis provides insights into the distribution of listings among different property owners.
Renting Price per City Quarter (“Arrondissements”): Explore the renting prices across different city quarters (arrondissements) in Paris. Analyze the variation in prices and identify areas with higher or lower rental rates. Visualize this information using interactive maps or charts.
Visit Frequency of Different Quarters According to Time: Determine the frequency of visits to different city quarters over time. Analyze trends in visitor activity and identify popular quarters during specific periods. Visualize visit frequency using different plots.
The Shiny application should provide an intuitive and user-friendly interface for users to interact with the data and explore various insights related to Airbnb listings in Paris.
In this analysis, we considered several key features present in the dataset to gain insights into the Airbnb listings. The features investigated are as follows:
By analyzing these features, we aimed to uncover patterns, trends, and relationships within the dataset, providing valuable insights into the Airbnb market in the study area. The findings from this analysis can inform various stakeholders, including hosts, guests, and policymakers, in making informed decisions related to Airbnb accommodations.
library(DataExplorer)
library(skimr)
library(tidyr)
library(shiny)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggmap)
## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggpubr)
library(writexl)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggmap':
##
## wind
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
library(leaflet)
library(corrplot)
## corrplot 0.92 loaded
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(here)
## here() starts at D:/Archana DSTI/Big Data Processing with R
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
First, load the dataset:
org_data <- load("D:/Archana DSTI/Big Data Processing with R/AirBnB.Rdata")
org_data
## [1] "L" "R"
Two lists are retrieved with names L and
R
When you run “View(L) and View(R)” commands, you’ll see the data from the data frames L and R displayed in a visual format directly within the R environment. This makes it easier for you to look at the data and understand what it contains, helping you explore and make sense of it more effectively.
View(L)
View(R)
We observe the following:
L will be utilized for analyzing features, while R will be employed to compute the visit frequency of different quarters over time.
Generate a summary of the dataset L using the skim() function
skim(L)
## Warning: There was 1 warning in `dplyr::summarize()`.
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## ℹ In group 0: .
## Caused by warning:
## ! There were 39 warnings in `dplyr::summarize()`.
## The first warning was:
## ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
## mangled_skimmers$funs)`.
## Caused by warning in `sorted_count()`:
## ! Variable contains value(s) of "" that have been converted to "empty".
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 38 remaining warnings.
| Name | L |
| Number of rows | 52725 |
| Number of columns | 95 |
| _______________________ | |
| Column type frequency: | |
| factor | 64 |
| logical | 2 |
| numeric | 29 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| listing_url | 0 | 1 | FALSE | 52725 | htt: 1, htt: 1, htt: 1, htt: 1 |
| last_scraped | 0 | 1 | FALSE | 2 | 201: 28982, 201: 23743 |
| name | 0 | 1 | FALSE | 50132 | Cha: 39, App: 25, Cos: 23, Stu: 23 |
| summary | 0 | 1 | FALSE | 49385 | emp: 2743, Mon: 45, Mon: 20, Mon: 17 |
| space | 1 | 1 | FALSE | 38156 | emp: 14253, The: 12, La : 8, The: 6 |
| description | 0 | 1 | FALSE | 52447 | Mon: 31, Mon: 15, Hel: 14, Mon: 12 |
| experiences_offered | 0 | 1 | FALSE | 1 | non: 52725 |
| neighborhood_overview | 1 | 1 | FALSE | 31248 | emp: 20496, Le : 36, The: 13, The: 12 |
| notes | 3 | 1 | FALSE | 15719 | emp: 35361, If : 79, Les: 69, Wit: 47 |
| transit | 1 | 1 | FALSE | 33294 | emp: 18546, Pub: 16, DIR: 12, Sub: 12 |
| access | 1 | 1 | FALSE | 24627 | emp: 24663, Log: 118, Tou: 64, The: 48 |
| interaction | 1 | 1 | FALSE | 23646 | emp: 26874, A f: 75, Non: 64, Nou: 52 |
| house_rules | 1 | 1 | FALSE | 27163 | emp: 22345, .: 126, Reg: 98, Dép: 70 |
| thumbnail_url | 0 | 1 | FALSE | 39257 | emp: 13465, htt: 2, htt: 2, htt: 2 |
| medium_url | 0 | 1 | FALSE | 39257 | emp: 13465, htt: 2, htt: 2, htt: 2 |
| picture_url | 0 | 1 | FALSE | 52719 | htt: 2, htt: 2, htt: 2, htt: 2 |
| xl_picture_url | 0 | 1 | FALSE | 39257 | emp: 13465, htt: 2, htt: 2, htt: 2 |
| host_url | 0 | 1 | FALSE | 44874 | htt: 155, htt: 139, htt: 91, htt: 80 |
| host_name | 0 | 1 | FALSE | 9344 | Mar: 583, Nic: 436, Pie: 418, Car: 388 |
| host_since | 0 | 1 | FALSE | 2306 | 201: 166, 201: 165, 201: 155, 201: 135 |
| host_location | 0 | 1 | FALSE | 1560 | Par: 40856, FR: 5463, US: 609, Par: 522 |
| host_about | 5 | 1 | FALSE | 23867 | emp: 21939, We : 155, Nou: 139, .: 124 |
| host_response_time | 0 | 1 | FALSE | 6 | wit: 15039, wit: 13926, N/A: 12517, wit: 10201 |
| host_response_rate | 0 | 1 | FALSE | 87 | 100: 26619, N/A: 12517, 90%: 2524, 80%: 1567 |
| host_acceptance_rate | 0 | 1 | FALSE | 96 | 100: 19680, N/A: 15591, 0%: 1377, 50%: 1292 |
| host_is_superhost | 0 | 1 | FALSE | 3 | f: 50513, t: 2166, emp: 46 |
| host_thumbnail_url | 0 | 1 | FALSE | 44652 | htt: 192, htt: 155, htt: 139, htt: 91 |
| host_picture_url | 0 | 1 | FALSE | 44652 | htt: 192, htt: 155, htt: 139, htt: 91 |
| host_neighbourhood | 0 | 1 | FALSE | 231 | emp: 6541, Mon: 2968, Rép: 2271, But: 2140 |
| host_verifications | 0 | 1 | FALSE | 136 | [’e: 19488, [’e: 14829, [’e: 4194, [’e: 4085 |
| host_has_profile_pic | 0 | 1 | FALSE | 3 | t: 52487, f: 192, emp: 46 |
| host_identity_verified | 0 | 1 | FALSE | 3 | t: 26949, f: 25730, emp: 46 |
| street | 0 | 1 | FALSE | 8531 | Par: 308, Bou: 209, Rue: 202, Rue: 202 |
| neighbourhood | 0 | 1 | FALSE | 64 | emp: 7457, Mon: 2878, Rép: 2315, But: 2174 |
| neighbourhood_cleansed | 0 | 1 | FALSE | 20 | But: 6025, Pop: 4883, Vau: 3878, Bat: 3603 |
| city | 0 | 1 | FALSE | 136 | Par: 50825, Par: 115, Par: 106, Par: 87 |
| state | 0 | 1 | FALSE | 53 | Île: 50841, IDF: 1355, Ile: 271, emp: 72 |
| zipcode | 0 | 1 | FALSE | 79 | 750: 5973, 750: 4825, 750: 3799, 750: 3511 |
| market | 0 | 1 | FALSE | 30 | Par: 49392, emp: 3275, Oth: 15, Dal: 5 |
| smart_location | 0 | 1 | FALSE | 137 | Par: 50824, Par: 115, Par: 106, Par: 87 |
| country_code | 0 | 1 | FALSE | 2 | FR: 52724, CH: 1 |
| country | 0 | 1 | FALSE | 2 | Fra: 52724, Swi: 1 |
| is_location_exact | 0 | 1 | FALSE | 2 | t: 45356, f: 7369 |
| property_type | 0 | 1 | FALSE | 20 | Apa: 50663, Lof: 567, Hou: 537, Bed: 394 |
| room_type | 0 | 1 | FALSE | 3 | Ent: 45177, Pri: 7001, Sha: 547 |
| bed_type | 0 | 1 | FALSE | 5 | Rea: 45993, Pul: 5066, Cou: 1182, Fut: 449 |
| amenities | 0 | 1 | FALSE | 37737 | {}: 552, {TV: 95, {In: 90, {In: 68 |
| price | 0 | 1 | FALSE | 498 | $60: 3055, $50: 3047, $70: 2787, $80: 2598 |
| weekly_price | 0 | 1 | FALSE | 1186 | emp: 30034, $50: 1378, $40: 1291, $45: 1083 |
| monthly_price | 0 | 1 | FALSE | 1473 | emp: 37531, $1,: 769, $1,: 694, $2,: 637 |
| security_deposit | 0 | 1 | FALSE | 304 | emp: 20321, $30: 5421, $50: 5179, $20: 5040 |
| cleaning_fee | 0 | 1 | FALSE | 157 | emp: 20122, $30: 4904, $20: 4879, $50: 3281 |
| extra_people | 0 | 1 | FALSE | 89 | $0.: 37324, $10: 4453, $20: 2653, $15: 2469 |
| calendar_updated | 0 | 1 | FALSE | 61 | tod: 7594, 2 w: 5237, a w: 4351, 3 w: 3499 |
| calendar_last_scraped | 0 | 1 | FALSE | 2 | 201: 30064, 201: 22661 |
| first_review | 0 | 1 | FALSE | 1946 | emp: 14508, 201: 212, 201: 193, 201: 186 |
| last_review | 0 | 1 | FALSE | 1046 | emp: 14509, 201: 1327, 201: 1202, 201: 1116 |
| requires_license | 0 | 1 | FALSE | 1 | f: 52725 |
| license | 0 | 1 | FALSE | 2 | emp: 52724, AJO: 1 |
| jurisdiction_names | 0 | 1 | FALSE | 2 | Par: 51726, emp: 999 |
| instant_bookable | 0 | 1 | FALSE | 2 | f: 44186, t: 8539 |
| cancellation_policy | 0 | 1 | FALSE | 5 | fle: 19244, str: 18427, mod: 15039, sup: 9 |
| require_guest_profile_picture | 0 | 1 | FALSE | 2 | f: 51816, t: 909 |
| require_guest_phone_verification | 0 | 1 | FALSE | 2 | f: 51014, t: 1711 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| neighbourhood_group_cleansed | 52725 | 0 | NaN | : |
| has_availability | 52725 | 0 | NaN | : |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 7.069608e+06 | 4180018.28 | 2.62300e+03 | 3.470301e+06 | 6.965852e+06 | 1.074006e+07 | 1.381956e+07 | ▇▆▇▅▇ |
| scrape_id | 0 | 1.00 | 2.016070e+13 | 0.00 | 2.01607e+13 | 2.016070e+13 | 2.016070e+13 | 2.016070e+13 | 2.016070e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 2.248560e+07 | 20345155.79 | 2.62600e+03 | 6.158190e+06 | 1.588541e+07 | 3.434872e+07 | 8.139705e+07 | ▇▃▂▁▁ |
| host_listings_count | 46 | 1.00 | 5.830000e+00 | 28.97 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.024000e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 46 | 1.00 | 5.830000e+00 | 28.97 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.024000e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 4.886000e+01 | 0.02 | 4.88100e+01 | 4.885000e+01 | 4.886000e+01 | 4.888000e+01 | 4.891000e+01 | ▁▅▇▇▃ |
| longitude | 0 | 1.00 | 2.340000e+00 | 0.03 | 2.22000e+00 | 2.320000e+00 | 2.350000e+00 | 2.370000e+00 | 2.470000e+00 | ▁▃▇▃▁ |
| accommodates | 0 | 1.00 | 3.050000e+00 | 1.46 | 1.00000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| bathrooms | 243 | 1.00 | 1.090000e+00 | 0.38 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 8.000000e+00 | ▇▁▁▁▁ |
| bedrooms | 193 | 1.00 | 1.060000e+00 | 0.79 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+01 | ▇▁▁▁▁ |
| beds | 80 | 1.00 | 1.680000e+00 | 1.05 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| square_feet | 50218 | 0.05 | 3.679900e+02 | 485.53 | 0.00000e+00 | 0.000000e+00 | 3.230000e+02 | 5.380000e+02 | 1.505900e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.350000e+00 | 0.92 | 0.00000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 3.130000e+00 | 8.01 | 1.00000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.000000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 1.252547e+05 | 16204416.39 | 1.00000e+00 | 6.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.165000e+01 | 11.26 | 0.00000e+00 | 0.000000e+00 | 8.000000e+00 | 2.300000e+01 | 3.000000e+01 | ▇▂▂▂▃ |
| availability_60 | 0 | 1.00 | 2.733000e+01 | 22.49 | 0.00000e+00 | 2.000000e+00 | 2.600000e+01 | 5.000000e+01 | 6.000000e+01 | ▇▃▃▂▆ |
| availability_90 | 0 | 1.00 | 4.118000e+01 | 33.56 | 0.00000e+00 | 6.000000e+00 | 3.700000e+01 | 7.500000e+01 | 9.000000e+01 | ▇▃▂▃▆ |
| availability_365 | 0 | 1.00 | 1.794600e+02 | 146.77 | 0.00000e+00 | 2.200000e+01 | 1.830000e+02 | 3.360000e+02 | 3.650000e+02 | ▇▂▂▂▇ |
| number_of_reviews | 0 | 1.00 | 1.259000e+01 | 25.21 | 0.00000e+00 | 0.000000e+00 | 3.000000e+00 | 1.300000e+01 | 3.920000e+02 | ▇▁▁▁▁ |
| review_scores_rating | 15454 | 0.71 | 9.101000e+01 | 8.82 | 2.00000e+01 | 8.700000e+01 | 9.300000e+01 | 9.700000e+01 | 1.000000e+02 | ▁▁▁▂▇ |
| review_scores_accuracy | 15575 | 0.70 | 9.410000e+00 | 0.87 | 2.00000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 15566 | 0.70 | 9.110000e+00 | 1.13 | 2.00000e+00 | 9.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▂▇ |
| review_scores_checkin | 15579 | 0.70 | 9.600000e+00 | 0.76 | 2.00000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▁▇ |
| review_scores_communication | 15543 | 0.71 | 9.650000e+00 | 0.74 | 2.00000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▁▇ |
| review_scores_location | 15560 | 0.70 | 9.440000e+00 | 0.82 | 2.00000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▁▇ |
| review_scores_value | 15559 | 0.70 | 9.180000e+00 | 0.91 | 2.00000e+00 | 9.000000e+00 | 9.000000e+00 | 1.000000e+01 | 1.000000e+01 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 4.090000e+00 | 14.23 | 1.00000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.550000e+02 | ▇▁▁▁▁ |
| reviews_per_month | 14508 | 0.72 | 1.340000e+00 | 1.39 | 1.00000e-02 | 3.600000e-01 | 9.000000e-01 | 1.870000e+00 | 1.429000e+01 | ▇▁▁▁▁ |
– Variable type: factor ———– A tibble: 64 x 6
– Variable type: logical ———- A tibble: 2 x 5
– Variable type: numeric ———- A tibble: 29 x 11
It also gives us a glimpse of the missing values, the unique values etc.
We begin by listing all the columns from L dataset.
colnames(L)
## [1] "id" "listing_url"
## [3] "scrape_id" "last_scraped"
## [5] "name" "summary"
## [7] "space" "description"
## [9] "experiences_offered" "neighborhood_overview"
## [11] "notes" "transit"
## [13] "access" "interaction"
## [15] "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url"
## [19] "xl_picture_url" "host_id"
## [21] "host_url" "host_name"
## [23] "host_since" "host_location"
## [25] "host_about" "host_response_time"
## [27] "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url"
## [31] "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count"
## [35] "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street"
## [39] "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city"
## [43] "state" "zipcode"
## [45] "market" "smart_location"
## [47] "country_code" "country"
## [49] "latitude" "longitude"
## [51] "is_location_exact" "property_type"
## [53] "room_type" "accommodates"
## [55] "bathrooms" "bedrooms"
## [57] "beds" "bed_type"
## [59] "amenities" "square_feet"
## [61] "price" "weekly_price"
## [63] "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included"
## [67] "extra_people" "minimum_nights"
## [69] "maximum_nights" "calendar_updated"
## [71] "has_availability" "availability_30"
## [73] "availability_60" "availability_90"
## [75] "availability_365" "calendar_last_scraped"
## [77] "number_of_reviews" "first_review"
## [79] "last_review" "review_scores_rating"
## [81] "review_scores_accuracy" "review_scores_cleanliness"
## [83] "review_scores_checkin" "review_scores_communication"
## [85] "review_scores_location" "review_scores_value"
## [87] "requires_license" "license"
## [89] "jurisdiction_names" "instant_bookable"
## [91] "cancellation_policy" "require_guest_profile_picture"
## [93] "require_guest_phone_verification" "calculated_host_listings_count"
## [95] "reviews_per_month"
In order to preserve the original dataset, we will creqte a new one, called New_data to keep only the relevant columns.
Using the select clause, a subset of the L
dataset is created to use only the variables (out of the 95) that will
be useful for the project:
New_data <- select(L, listing_id =id, Host_id= host_id, Host_name= host_name, bathrooms, bedrooms, beds, bed_type, Equipments= amenities, Property_type= property_type, Room_type= room_type, Nb_of_guests= accommodates,Price= price, guests_included, minimum_nights, maximum_nights,availability_over_one_year= availability_365, instant_bookable, cancellation_policy, city, Adresse= street, Neighbourhood=neighbourhood_cleansed, city_quarter=zipcode, latitude, longitude, security_deposit, transit, host_response_time, Superhost= host_is_superhost, Host_since= host_since, Listing_count= calculated_host_listings_count, Host_score= review_scores_rating, reviews_per_month,number_of_reviews,square_feet)
Retrieve the column names of the New_data dataframe
colnames(New_data)
## [1] "listing_id" "Host_id"
## [3] "Host_name" "bathrooms"
## [5] "bedrooms" "beds"
## [7] "bed_type" "Equipments"
## [9] "Property_type" "Room_type"
## [11] "Nb_of_guests" "Price"
## [13] "guests_included" "minimum_nights"
## [15] "maximum_nights" "availability_over_one_year"
## [17] "instant_bookable" "cancellation_policy"
## [19] "city" "Adresse"
## [21] "Neighbourhood" "city_quarter"
## [23] "latitude" "longitude"
## [25] "security_deposit" "transit"
## [27] "host_response_time" "Superhost"
## [29] "Host_since" "Listing_count"
## [31] "Host_score" "reviews_per_month"
## [33] "number_of_reviews" "square_feet"
Remove duplicate entries from the dataset
Also, the $ sign in the prices will give us problem when
manipulating the numbers so it needs to be removed as well:
New_data <- New_data %>% distinct(listing_id, .keep_all = TRUE)
To be able to manipulate them like numeric ones, we need to ensure that they are loaded with the appropriate data type, especially the “Price” column.
For this particular column, we see that :
# Removing the "$" character
New_data$Price <- substring(gsub(",", "", as.character(New_data$Price)),2)
Let’s take a glimpse at ‘Price’ column in the New_data dataframe to verify that the $ symbol is removed
glimpse(New_data[,"Price"])
## chr [1:52725] "60.00" "200.00" "80.00" "60.00" "50.00" "191.00" "100.00" ...
Let’s take a look into the data types in the ‘New_data’ dataset.
data_types <- data.frame(Column_Name = names(New_data), Data_Type = sapply(New_data, class))
print(data_types)
## Column_Name Data_Type
## listing_id listing_id integer
## Host_id Host_id integer
## Host_name Host_name factor
## bathrooms bathrooms numeric
## bedrooms bedrooms integer
## beds beds integer
## bed_type bed_type factor
## Equipments Equipments factor
## Property_type Property_type factor
## Room_type Room_type factor
## Nb_of_guests Nb_of_guests integer
## Price Price character
## guests_included guests_included integer
## minimum_nights minimum_nights integer
## maximum_nights maximum_nights integer
## availability_over_one_year availability_over_one_year integer
## instant_bookable instant_bookable factor
## cancellation_policy cancellation_policy factor
## city city factor
## Adresse Adresse factor
## Neighbourhood Neighbourhood factor
## city_quarter city_quarter factor
## latitude latitude numeric
## longitude longitude numeric
## security_deposit security_deposit factor
## transit transit factor
## host_response_time host_response_time factor
## Superhost Superhost factor
## Host_since Host_since factor
## Listing_count Listing_count integer
## Host_score Host_score integer
## reviews_per_month reviews_per_month numeric
## number_of_reviews number_of_reviews integer
## square_feet square_feet integer
To ensure that the variables have appropriate data type, we need to apply data type conversions as following:
1. Converting to numeric columns:
# Changing the data type
New_data$bedrooms <- as.numeric((New_data$bedrooms))
New_data$beds <- as.numeric((New_data$beds))
New_data$Price <- as.numeric((New_data$Price))
New_data$guests_included <- as.numeric((New_data$guests_included))
New_data$minimum_nights <- as.numeric((New_data$minimum_nights))
New_data$maximum_nights <- as.numeric((New_data$maximum_nights))
New_data$availability_over_one_year <- as.numeric((New_data$availability_over_one_year))
New_data$security_deposit <- as.numeric((New_data$security_deposit))
New_data$Listing_count <- as.numeric((New_data$Listing_count))
New_data$Host_score <- as.numeric((New_data$Host_score))
New_data$number_of_reviews <- as.numeric((New_data$number_of_reviews))
New_data$square_feet <- as.numeric((New_data$square_feet))
2. Converting to character columns:
New_data$Neighbourhood <- as.character(New_data$Neighbourhood)
3. Converting to date columns
New_data$Host_since <- as.Date(New_data$Host_since)
Finally, let’s ensure the data types are updated.
data_types <- data.frame(Column_Name = names(New_data), Data_Type = sapply(New_data, class))
print(data_types)
## Column_Name Data_Type
## listing_id listing_id integer
## Host_id Host_id integer
## Host_name Host_name factor
## bathrooms bathrooms numeric
## bedrooms bedrooms numeric
## beds beds numeric
## bed_type bed_type factor
## Equipments Equipments factor
## Property_type Property_type factor
## Room_type Room_type factor
## Nb_of_guests Nb_of_guests integer
## Price Price numeric
## guests_included guests_included numeric
## minimum_nights minimum_nights numeric
## maximum_nights maximum_nights numeric
## availability_over_one_year availability_over_one_year numeric
## instant_bookable instant_bookable factor
## cancellation_policy cancellation_policy factor
## city city factor
## Adresse Adresse factor
## Neighbourhood Neighbourhood character
## city_quarter city_quarter factor
## latitude latitude numeric
## longitude longitude numeric
## security_deposit security_deposit numeric
## transit transit factor
## host_response_time host_response_time factor
## Superhost Superhost factor
## Host_since Host_since Date
## Listing_count Listing_count numeric
## Host_score Host_score numeric
## reviews_per_month reviews_per_month numeric
## number_of_reviews number_of_reviews numeric
## square_feet square_feet numeric
Removing Outliers
summary(New_data$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 55.00 75.00 96.51 110.00 6081.00
A quick computation using the R summary() function – as done above – shows that the minimum price is $0 and the maximum is $6081. Although human kindness is limitless, free rent do not exist in AirBnB. Additionally, it sounds unreasonable to spend $6081 to rent a property for one night. At the time of writing, a quick request for renting in Paris using AirBnB website revealed that the range of price goes from around $20 to approximatively $1300. Consequently we will use these values as range for the variable price and remove the outliers.
# Setting the price range
New_data <- New_data %>%
filter(New_data$Price >= 20 &
New_data$Price <= 1300)
summary(New_data$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 55.00 75.00 95.74 110.00 1285.00
After cleaning up the dataset, we can see that median of price is still $75 with a minimum at $20 and and a maximum at $1285.
Quantifying missing values
In a first step we are going to assess the robustness and relevance of the selected data by notably assessing the amount of missing values present in each variable.
# Count missing values for all columns
missing_counts <- colSums(is.na(New_data))
# Print the results
missing_counts
## listing_id Host_id
## 0 0
## Host_name bathrooms
## 0 243
## bedrooms beds
## 193 80
## bed_type Equipments
## 0 0
## Property_type Room_type
## 0 0
## Nb_of_guests Price
## 0 0
## guests_included minimum_nights
## 0 0
## maximum_nights availability_over_one_year
## 0 0
## instant_bookable cancellation_policy
## 0 0
## city Adresse
## 0 0
## Neighbourhood city_quarter
## 0 0
## latitude longitude
## 0 0
## security_deposit transit
## 0 1
## host_response_time Superhost
## 0 0
## Host_since Listing_count
## 46 0
## Host_score reviews_per_month
## 15398 14458
## number_of_reviews square_feet
## 0 50111
Here we found that 95% of values present in the square_feet variable correspond to missing data. A very small proportion of values in bedrooms and bathrooms columns are also missing. This observation prompts us to unambiguously suppress the square_feet variable from airbnb_data. In contrast, handling missing values for bedrooms and bathrooms requires a bit of a discussion. Indeed, we could use different approaches here. First, we could fill the missing values by replacing them with the most representative value. As the variables ‘bedrooms’ and ‘bathrooms’ are categorical, we could use the mean to fill in the missing values with it. Another approach could be to simply remove these rows from the dataset as they represent so little. This will not affect the overall dataset and analysis.
Fill in missing value
The approach followed in this case is to fill the missing values with the mean value of the corresponding column (bathrooms, bedrooms and beds):
#Bathrooms
# Calculate the mean value of the "bathrooms" column
mean_value <- mean(New_data$bathrooms, na.rm = TRUE)
# Replace missing values with the mean value
New_data$bathrooms <- na.aggregate(New_data$bathrooms)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled <- mean(New_data$bathrooms)
print(paste("Value with which missing values are filled:", mean_value_filled))
## [1] "Value with which missing values are filled: 1.08914898898287"
#Bedrooms
# Calculate the mean value of the "bedrooms" column
mean_value_bedrooms <- mean(New_data$bedrooms, na.rm = TRUE)
# Replace missing values with the mean value
New_data$bedrooms <- na.aggregate(New_data$bedrooms)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled_bedrooms <- mean(New_data$bedrooms)
print(paste("Value with which missing values are filled", mean_value_filled_bedrooms))
## [1] "Value with which missing values are filled 1.0583904011598"
#Beds
# Calculate the mean value of the "beds" column
mean_value_beds <- mean(New_data$beds, na.rm = TRUE)
# Replace missing values with the mean value
New_data$beds <- na.aggregate(New_data$beds)
# Calculate the mean value again to see the value with which missing values were filled
mean_value_filled_beds <- mean(New_data$beds)
print(paste("Value with which missing values are filled", mean_value_filled_beds))
## [1] "Value with which missing values are filled 1.68261763362266"
In this analysis, we extracted distinct values from the “Neighbourhood” column of the New_data dataframe, checking for any spelling variations or inconsistencies across neighborhoods. The resulting list provides a comprehensive overview of unique neighborhood names within the dataset.
# Get distinct values from the "Neighbourhood" column in the New_data dataframe
distinct_neighbourhoods <- unique(New_data$Neighbourhood)
distinct_neighbourhoods_list <- as.list(distinct_neighbourhoods)
distinct_neighbourhoods_list
## [[1]]
## [1] "Batignolles-Monceau"
##
## [[2]]
## [1] "Palais-Bourbon"
##
## [[3]]
## [1] "Buttes-Chaumont"
##
## [[4]]
## [1] "Opéra"
##
## [[5]]
## [1] "Entrepôt"
##
## [[6]]
## [1] "Gobelins"
##
## [[7]]
## [1] "Vaugirard"
##
## [[8]]
## [1] "Reuilly"
##
## [[9]]
## [1] "Louvre"
##
## [[10]]
## [1] "Luxembourg"
##
## [[11]]
## [1] "Élysée"
##
## [[12]]
## [1] "Temple"
##
## [[13]]
## [1] "Ménilmontant"
##
## [[14]]
## [1] "Panthéon"
##
## [[15]]
## [1] "Passy"
##
## [[16]]
## [1] "Observatoire"
##
## [[17]]
## [1] "Popincourt"
##
## [[18]]
## [1] "Bourse"
##
## [[19]]
## [1] "Buttes-Montmartre"
##
## [[20]]
## [1] "Hôtel-de-Ville"
City_quarter Column cleaning
# Cleaning the city quarters (Arrondissements):
New_data$city = str_sub(New_data$city,1, 5)
New_data$city_quarter = str_sub(New_data$city_quarter, -2)
New_data <- subset(New_data, New_data$city == 'Paris' & New_data$city_quarter != "" & New_data$city_quarter <= 20 & New_data$city_quarter != '00' & New_data$city_quarter != ' ')
unique_values <- unique(New_data$city_quarter)
# Prinitng unique values of the city quarters (arrondissements)
print(unique_values)
## [1] "17" "08" "18" "13" "16" "09" "10" "07" "15" "06" "19" "01" "20" "11" "04"
## [16] "02" "03" "12" "05" "14"
The subset of the New_data dataset comprises records corresponding to properties located in Paris. Specifically, it includes entries where the city quarter (or arrondissement) information is available and falls within the range of 01 to 20, excluding ‘00’. This filtering ensures that only relevant data related to properties situated in Paris and categorized within valid city quarters (Arrondissements) is retained for further analysis.
The data is now cleaned, let’s have a look at the first rows of our new dataset and also the summary
head(New_data)
## listing_id Host_id Host_name bathrooms bedrooms beds bed_type
## 1 4867396 9703910 Matthieu 1 1 1 Real Bed
## 2 7704653 35777602 Claire 2 2 3 Real Bed
## 3 2725029 13945253 Vincent 1 1 1 Real Bed
## 4 9337509 5107123 Julie 1 1 1 Real Bed
## 5 12928158 51195601 Daniele 1 1 1 Real Bed
## 6 5589471 28980052 Philippe 3 4 4 Real Bed
## Equipments
## 1 {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Dryer,Essentials}
## 2 {"Wireless Internet",Kitchen,"Elevator in Building","Buzzer/Wireless Intercom",Washer,Dryer,Essentials}
## 3 {TV,Internet,"Wireless Internet",Kitchen,"Indoor Fireplace",Heating,"Family/Kid Friendly",Washer,Essentials,Shampoo}
## 4 {"Wireless Internet",Kitchen,Heating,Washer,Essentials}
## 5 {"Wireless Internet",Kitchen,"Smoking Allowed","Pets Allowed",Breakfast,"Elevator in Building",Heating,"Family/Kid Friendly",Washer,Dryer,Essentials,Shampoo}
## 6 {TV,Internet,"Wireless Internet",Kitchen,Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Fire Extinguisher",Essentials}
## Property_type Room_type Nb_of_guests Price guests_included
## 1 Apartment Entire home/apt 2 60 1
## 2 Apartment Entire home/apt 4 200 1
## 3 Apartment Entire home/apt 2 80 1
## 4 Apartment Entire home/apt 2 60 0
## 5 Apartment Private room 2 50 1
## 6 House Entire home/apt 6 191 1
## minimum_nights maximum_nights availability_over_one_year instant_bookable
## 1 1 1125 0 f
## 2 1 1125 0 f
## 3 3 1125 298 f
## 4 2 1125 364 f
## 5 1 30 89 f
## 6 3 1125 0 f
## cancellation_policy city
## 1 flexible Paris
## 2 flexible Paris
## 3 flexible Paris
## 4 flexible Paris
## 5 flexible Paris
## 6 flexible Paris
## Adresse Neighbourhood
## 1 Rue Legendre, Paris, Île-de-France 75017, France Batignolles-Monceau
## 2 Avenue Mac-Mahon, Paris, Île-de-France 75017, France Batignolles-Monceau
## 3 Rue la Condamine, Paris, Île-de-France 75017, France Batignolles-Monceau
## 4 Rue Gauthey, Paris, Île-de-France 75017, France Batignolles-Monceau
## 5 Avenue Brunetière, Paris, Île-de-France 75017, France Batignolles-Monceau
## 6 Rue de Saussure, Paris, Île-de-France 75017, France Batignolles-Monceau
## city_quarter latitude longitude security_deposit transit host_response_time
## 1 17 48.88880 2.320466 94 N/A
## 2 17 48.87664 2.293724 1 N/A
## 3 17 48.88384 2.321031 208 within an hour
## 4 17 48.89236 2.322338 106 within a day
## 5 17 48.88942 2.298321 1 within an hour
## 6 17 48.88707 2.312212 1 N/A
## Superhost Host_since Listing_count Host_score reviews_per_month
## 1 f 2013-10-29 1 100 0.07
## 2 f 2015-06-14 1 NA NA
## 3 f 2014-04-06 1 80 0.11
## 4 f 2013-02-16 1 80 0.15
## 5 f 2015-12-13 1 100 2.00
## 6 f 2015-03-08 1 NA NA
## number_of_reviews square_feet
## 1 1 NA
## 2 0 NA
## 3 1 NA
## 4 1 NA
## 5 2 NA
## 6 0 NA
:
summary(New_data)
## listing_id Host_id Host_name bathrooms
## Min. : 2623 Min. : 2626 Marie : 564 Min. :0.000
## 1st Qu.: 3436213 1st Qu.: 6088109 Nicolas : 427 1st Qu.:1.000
## Median : 6920193 Median :15713334 Pierre : 408 Median :1.000
## Mean : 7007274 Mean :22236938 Caroline: 380 Mean :1.089
## 3rd Qu.:10563073 3rd Qu.:33957264 Anne : 377 3rd Qu.:1.000
## Max. :13819560 Max. :81397049 Sophie : 365 Max. :8.000
## (Other) :48792
## bedrooms beds bed_type
## Min. : 0.000 Min. : 0.000 Airbed : 27
## 1st Qu.: 1.000 1st Qu.: 1.000 Couch : 1159
## Median : 1.000 Median : 1.000 Futon : 433
## Mean : 1.057 Mean : 1.682 Pull-out Sofa: 4923
## 3rd Qu.: 1.000 3rd Qu.: 2.000 Real Bed :44771
## Max. :10.000 Max. :16.000
##
## Equipments
## {} : 532
## {TV,Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials} : 93
## {Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials} : 90
## {Internet,"Wireless Internet",Kitchen,Heating,Essentials} : 67
## {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Essentials}: 64
## {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer} : 64
## (Other) :50403
## Property_type Room_type Nb_of_guests
## Apartment :49355 Entire home/apt:44083 Min. : 1.000
## Loft : 549 Private room : 6745 1st Qu.: 2.000
## House : 508 Shared room : 485 Median : 2.000
## Bed & Breakfast: 375 Mean : 3.052
## Condominium : 255 3rd Qu.: 4.000
## Other : 117 Max. :16.000
## (Other) : 154
## Price guests_included minimum_nights maximum_nights
## Min. : 20.00 Min. : 0.000 Min. : 1.000 Min. :1.000e+00
## 1st Qu.: 55.00 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.:6.000e+01
## Median : 75.00 Median : 1.000 Median : 2.000 Median :1.125e+03
## Mean : 96.14 Mean : 1.356 Mean : 3.131 Mean :1.287e+05
## 3rd Qu.: 111.00 3rd Qu.: 2.000 3rd Qu.: 3.000 3rd Qu.:1.125e+03
## Max. :1285.00 Max. :16.000 Max. :1000.000 Max. :2.147e+09
##
## availability_over_one_year instant_bookable cancellation_policy
## Min. : 0 f:43069 flexible :18526
## 1st Qu.: 22 t: 8244 moderate :14720
## Median :183 strict :18057
## Mean :180 super_strict_30: 5
## 3rd Qu.:336 super_strict_60: 5
## Max. :365
##
## city
## Length:51313
## Class :character
## Mode :character
##
##
##
##
## Adresse
## Boulevard Voltaire, Paris, Île-de-France 75011, France : 209
## Rue du Faubourg Saint-Martin, Paris, Île-de-France 75010, France: 202
## Rue Oberkampf, Paris, Île-de-France 75011, France : 201
## Rue Saint-Maur, Paris, Île-de-France 75011, France : 196
## Rue de Charenton, Paris, Île-de-France 75012, France : 188
## Rue du Faubourg Saint-Denis, Paris, Île-de-France 75010, France : 174
## (Other) :50143
## Neighbourhood city_quarter latitude longitude
## Length:51313 Length:51313 Min. :48.82 Min. :2.230
## Class :character Class :character 1st Qu.:48.85 1st Qu.:2.323
## Mode :character Mode :character Median :48.86 Median :2.347
## Mean :48.86 Mean :2.344
## 3rd Qu.:48.88 3rd Qu.:2.369
## Max. :48.90 Max. :2.459
##
## security_deposit
## Min. : 1.00
## 1st Qu.: 1.00
## Median : 58.00
## Mean : 81.72
## 3rd Qu.:129.00
## Max. :304.00
##
## transit
## :17900
## Public transportation is a bit of a maze in Paris. I recommend you to book a transfer on the app Bonjour Paris (G00gle or Apple store). : 16
## DIRECT ACCESS From Airport CDG (Charles de Gaule-Roissy) DIRECT ACCESS From Airport ORLY EASY & FAST ACCESS from TRAIN STATIONS METRO Station Saint Michel line 4 is 3 minutes by foot from my place RER Station Saint Michel line B is 3 minutes by foot from my place TAXI STATION is 3 minutes by foot from my place By CAR : 2 choices of PARKING both 5 minutes by foot from my place : “Parking Saint Michel” Rue Francisque Gay n°46 and “Parking Notre Dame” Place Jean Paul II: 12
## Subway: Châtelet (lines 1, 4, 7, 11 & 14, RER A, B & D) : 12
## Odéon station line 4 and 10 Saint Michel station line 4, RER B and RER C : 10
## (Other) :33362
## NA's : 1
## host_response_time Superhost Host_since Listing_count
## : 44 : 44 Min. :2008-08-30 Min. : 1.000
## a few days or more: 973 f:49124 1st Qu.:2013-04-27 1st Qu.: 1.000
## N/A :12118 t: 2145 Median :2014-05-23 Median : 1.000
## within a day : 9969 Mean :2014-04-03 Mean : 4.129
## within a few hours:13596 3rd Qu.:2015-05-27 3rd Qu.: 1.000
## within an hour :14613 Max. :2016-07-03 Max. :155.000
## NA's :44
## Host_score reviews_per_month number_of_reviews square_feet
## Min. : 20.00 Min. : 0.010 Min. : 0.00 Min. : 0.0
## 1st Qu.: 87.00 1st Qu.: 0.360 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 93.00 Median : 0.900 Median : 3.00 Median : 323.0
## Mean : 91.02 Mean : 1.335 Mean : 12.78 Mean : 368.4
## 3rd Qu.: 97.00 3rd Qu.: 1.860 3rd Qu.: 13.00 3rd Qu.: 538.0
## Max. :100.00 Max. :14.290 Max. :392.00 Max. :15059.0
## NA's :14724 NA's :13826 NA's :48833
As a customer, the primary consideration when renting a place is the price. The variability in prices is inherently influenced by the type of property and room being rented. For instance, a shared room in a dormitory may have a different price range compared to a shared room in a large villa. Similarly, renting a full apartment is expected to be more expensive than renting a single room. To gain a deeper understanding of pricing dynamics in Paris, we first investigate the Airbnb dataset from this perspective. Our aim is to decipher the key factors influencing property prices and, specifically, identify the features that most significantly impact apartment prices offered by Airbnb hosts in Paris.
The Parisian offering on Airbnb is predominantly composed of entire apartments available for rent
To streamline our analysis and focus on relevant features, we aim to reduce the size of our dataset. Assuming that a rented property features commonly include equipped kitchen, television, wifi or internet, sofa, etc., we prioritize selecting key attributes that customers typically consider when renting a place. These include the type of room or property, as well as the number of rooms and bathrooms, which are among the most salient factors influencing rental decisions.
features_and_price <- New_data %>%
select(Property_type,
Room_type,
bathrooms,
bedrooms,
beds,
Neighbourhood,
Nb_of_guests,
Price)
View(features_and_price)
Correlation between Price and Apartment Features
# Plot the correlation matrix
cor_featuer_and_price <- features_and_price[, sapply(features_and_price, is.numeric)]
cor_featuer_and_price <- cor_featuer_and_price[complete.cases(cor_featuer_and_price), ]
correlation_matrix <- cor(cor_featuer_and_price, method = "spearman")
corrplot(correlation_matrix, method = "color", main = "")
Target variable Price has positive correlation with : bathrooms, beds, bedrooms, and number of guests. Thus, we can analyze the relationship between the price and some of these variables.
p1<- ggplot(features_and_price) +
geom_histogram(aes(Price), fill = "#971a4a", alpha = 0.85, binwidth = 15) +
theme_minimal(base_size = 13) +
xlab("Price") +
ylab("Frequency") +
ggtitle("Distribution of Price")
p2 <- ggplot(features_and_price, aes(Price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "#971a4a") +
geom_density(alpha = 0.2, fill = "#971a4a") +
ggtitle("Logarithmic distribution of Price", subtitle = expression("With" ~'log'[10] ~ "transformation of x-axis")) +
scale_x_log10()
ggarrange(p1,
p2,
nrow = 1,
ncol=2,
labels = c("1. ", "2. "))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In the logarithmic distribution of the variable price a better insight view of this variable can be perceived. The distribution is not gaussian but remains less skewed . Next we will investigate if prices are different between property types and room types proposed in Paris by AirBnB hosts.
Let’s analyze more closely the total count for each distinct room type.
# Count the total number of occurrences for each distinct room type
room_type_counts <- features_and_price %>%
group_by(Room_type) %>%
summarize(Total_Count = n())
# Print the total count for each distinct room type
print(room_type_counts)
## # A tibble: 3 × 2
## Room_type Total_Count
## <fct> <int>
## 1 Entire home/apt 44083
## 2 Private room 6745
## 3 Shared room 485
Now, let’s plot the distribution of room types in the dataset using a polar bar chart, making it easier to compare the relative frequencies of different room types.
room_types_counts <- table(features_and_price$Room_type)
room_types <- names(room_types_counts)
counts <- as.vector(room_types_counts)
percentages <- scales::percent(round(counts/sum(counts), 2))
room_types_percentages <- sprintf("%s (%s)", room_types, percentages)
room_types_counts_df <- data.frame(group = room_types, value = counts)
res2 <- ggplot(room_types_counts_df, aes(x = "", y = value, fill = room_types_percentages)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
scale_fill_brewer("Room Types", palette = "BuPu") +
ggtitle("Distribution of Room types") +
theme(plot.title = element_text(color = "black", size = 12, hjust = 0.5)) +
ylab("") +
xlab("") +
labs(fill="") +
theme(axis.ticks = element_blank(), panel.grid = element_blank(), axis.text = element_blank()) +
geom_text(aes(label = percentages), size = 5, position = position_stack(vjust = 0.5))
res2
From the plot we can deduce that people generally tend to rent the entire apartment which comprise 86% of the total distribution followed by private rooms (13%) and shared room (1%).
Distribution of the price for each room type
ggplot(features_and_price) +
geom_boxplot(aes(x = Room_type,y = Price,fill = Room_type)) +
labs(x = "Room Type",y = "Price",fill = "Room Type") +
coord_flip()
The price increases in this order: shared room > private room > entire home/apt. Let’s have a look at the average price by room type.
Average price by Room type
features_and_price %>%
group_by(Room_type) %>%
summarise(mean_price = mean(Price, na.rm = TRUE)) %>%
ggplot(aes(x = reorder(Room_type, mean_price), y = mean_price, fill = Room_type)) +
geom_col(stat ="identity", fill="#56478b") +
coord_flip() +
theme_minimal() +
labs(x = "Room Type", y = "Price") +
geom_text(aes(label = round(mean_price,digit = 2)), hjust = 1.0, color = "white", size = 4.5) +
ggtitle("Average Price by Room Type") +
xlab("Room Type") +
ylab("Average Price")
## Warning in geom_col(stat = "identity", fill = "#56478b"): Ignoring unknown
## parameters: `stat`
Distribution of Listings Under $1,000 by room type
ggplot(features_and_price, aes(x = Price, fill = Room_type)) +
geom_histogram(position = "dodge") +
scale_fill_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), name = "Room Type") +
labs(title = "Distribution of Listings Under $1,000 by Room type", x = "Price per night", y = "Number of listings") +
theme(plot.title=element_text(vjust=2),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This visualization offers insights into the distribution of listing prices under $1,000 per night, illustrating the composition of room types within this price range. We notice a majority of listings are between $50 to $250.
Analysis by Property type
Let’s analyze more closely the total count for each distinct property type.
# Count the total number of occurrences for each distinct property type
property_type_counts <- features_and_price %>%
group_by(Property_type) %>%
summarize(Total_Count = n())
# Print the total count for each distinct property type
print(property_type_counts)
## # A tibble: 19 × 2
## Property_type Total_Count
## <fct> <int>
## 1 "" 3
## 2 "Apartment" 49355
## 3 "Bed & Breakfast" 375
## 4 "Boat" 29
## 5 "Cabin" 1
## 6 "Camper/RV" 3
## 7 "Cave" 1
## 8 "Chalet" 1
## 9 "Condominium" 255
## 10 "Dorm" 26
## 11 "Earth House" 1
## 12 "House" 508
## 13 "Igloo" 1
## 14 "Loft" 549
## 15 "Other" 117
## 16 "Tipi" 1
## 17 "Townhouse" 77
## 18 "Treehouse" 1
## 19 "Villa" 9
We can see that parisian hosts propose three types of rooms: Entire home/apt, Private room and Shared room. Property types are more diverse. we have some surprising propositions there as cabin, cave, chalet, earth house or igloo. There is also a property type ‘other’ where all these unexpected propositions could have been piled in. Nevertheless, considering ‘other’ would be vague to draw any conclusion from an analysis, we will skip it in our analysis along with those for which the count is 1. Consequently, we are going to keep only the following relevant and explicit property types to perform our analysis: Apartment, Bed & Breakfast, Boat, Condominium, Dorm, House, Loft, Townhouse, Villa.
list_property_types <- c("Apartment",
"Bed & Breakfast",
"Boat",
"Condominium",
"Dorm",
"House",
"Loft",
"Townhouse",
"Villa")
features_and_price <- features_and_price %>%
filter(Property_type %in% list_property_types)
# Count the total number of occurrences for each distinct property type
property_type_counts <- features_and_price %>%
group_by(Property_type) %>%
summarize(Total_Count = n())
# Print the total count for each distinct property type
print(property_type_counts)
## # A tibble: 9 × 2
## Property_type Total_Count
## <fct> <int>
## 1 Apartment 49355
## 2 Bed & Breakfast 375
## 3 Boat 29
## 4 Condominium 255
## 5 Dorm 26
## 6 House 508
## 7 Loft 549
## 8 Townhouse 77
## 9 Villa 9
Distribution by property
We begin by plotting the distribution of property types in the dataset using a polar bar chart, making it easier to compare the relative frequencies of different property types.
# Calculate percentages of property types
property_type_df <- features_and_price %>%
count(Property_type) %>%
mutate(Percentage = n / sum(n))
# Define custom colors for the pie chart
custom_colors <- c("#ffff69", "#33a02c", "#a6cee3", "#b2df8a", "#33a02c", "#fb9a99", "#e31a1c", "#fdbf6f", "#ff7f00")
# Create the pie chart
pie_chart <- ggplot(property_type_df, aes(x = "", y = Percentage, fill = Property_type)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
scale_fill_manual("Property Types", values = custom_colors, labels = paste0(property_type_df$Property_type, ": ", scales::percent(property_type_df$Percentage))) +
labs(title = "Distribution of Property Types",
fill = "Property Types",
y = "Percentage") +
theme_void() +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
guides(fill = guide_legend(title = "Property Types", label.position = "right"))
# Display the pie chart
print(pie_chart)
We notice that apartments are the most rented out property with over 96% of distribution.
Price distribution by property
ggplot(features_and_price) +
geom_boxplot(aes(x = Property_type, y = Price, fill = Property_type)) +
labs(x = "Property Type", y = "Price", fill = "Property Type", title = "Price Distribution by Property Type") +
coord_flip()
This visualization enables us to depict the distribution of prices across various categories of properties. Primarily, we notice that villa appers to be uniformly distributed in the price range 250 and 1250. The distribution appears similar with notable distinctions observed in the categories of Townhouse, Loft, House, and Bed & Breakfast, which exhibit higher-than-average rental prices. However, given that these property types along with others, except Apartment, collectively comprise only 4% of our dataset. So, I’ve opted not to delve deeper into their analysis.
I have chosen to define features as the following key attributes: beds, bathrooms, bedrooms, and the number of guests.
We will now explore their relationship with the price using the visualization provided below:
a1<- ggplot(data=features_and_price) +
geom_smooth(mapping = aes(x=Price,y=beds), method = 'gam', col='grey')
a2<- ggplot(data=features_and_price) +
geom_smooth(mapping = aes(x=Price,y=bedrooms), method = 'gam', col='blue')
a3<- ggplot(data=features_and_price) +
geom_smooth(mapping = aes(x=Price,y=bathrooms), method = 'gam', col='violet')
a4<- ggplot(data=features_and_price) +
geom_smooth(mapping = aes(x=Price,y=Nb_of_guests), method = 'gam', col='black')
ggarrange(a1, a2, a3, a4, nrow=2, ncol=2, align = "hv")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
Let’s analyze collectively the relationship distribution.
# ggplot code
pfeatures <- ggplot(data = features_and_price) +
geom_smooth(mapping = aes(x = Price, y = beds, col = 'beds'), method = 'gam') +
geom_smooth(mapping = aes(x = Price, y = bedrooms, col = 'bedrooms'), method = 'gam') +
geom_smooth(mapping = aes(x = Price, y = bathrooms, col = 'bathrooms'), method = 'gam') +
geom_smooth(mapping = aes(x = Price, y = Nb_of_guests, col = 'Nb_of_guests'), method = 'gam') +
ggtitle("Price versus features") + labs(y = "Features", x = "Price") +
scale_fill_manual()
# Convert ggplot object to plotly object
pfeatures_plotly <- ggplotly(pfeatures)
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
# Print the interactive plot
pfeatures_plotly
We can see that the price tend to go higher as the number of features increase.
Analyzing the relationship between Price and number of bathrooms
features_and_price["bathrooms"] <- features_and_price["bathrooms"] %>%
map(., floor)
bath_distr <- (ggplot(features_and_price,
aes(x = Price))
+ geom_histogram(bins = 15,
aes(y = ..density..),
fill = "#66CC99")
+ geom_density(lty = 2, color = "#fb8072")
+ labs(title = "Distribution of prices vs Bathroom numbers",
x = "Price",
y = "Density")
+ theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_text(size = 7))
+ facet_wrap(~ factor(bathrooms),
scales = "free_y"))
bath_distr
It helps to visualize how prices are distributed across different numbers of bathrooms, providing insights into the relationship between these two variables in the dataset.
apt_features_and_price_bath <- features_and_price %>%
filter(bathrooms <= 6)
ggplot(data = features_and_price, aes(x = bathrooms, y = Price, color=bathrooms)) +
geom_jitter(width = 0.1,height = 0.2,size=0.1)
For the apartments with 0 bathroom, the price is significantly low. We observe that majority of apartments rented have 1, 2, or 3 bathrooms. We can also see that rented properties with either 1 bathroom or 2 bathrooms share the same price distribution and is normally under $500. For others, there is hardly any relation of bathroom with price, except for the apartments with 3 bathrooms which has a fair and uniform distribution between $50 and $1000. .
Analyzing the relationship between Price and Number of Beds
beds_distr <- (ggplot(features_and_price,
aes(x = Price))
+ geom_histogram(bins = 15,
aes(y = ..density..),
fill = "#66CC99")
+ geom_density(lty = 2,
color = "#fb8072")
+ labs(title = "Distribution of prices vs Beds numbers",
x = "Price",
y = "")
+ theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_text(size = 7))
+ facet_wrap(~ factor(beds),
scales = "free_y"))
beds_distr
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
beds_box <- (ggplot(features_and_price)
+ geom_boxplot(aes(x = factor(round(beds)),
y = Price,
fill = factor(beds)))
+ labs(x = "# of Beds",
y = "Price",
fill = "# of Beds")
+ coord_flip())
bed_scatt <- (ggplot(data = features_and_price, aes(x = beds, y = Price, color=beds)) +
geom_jitter(width = 0.1,height = 0.2,size=0.1))
ggarrange(beds_box,
bed_scatt,
nrow = 2,
ncol = 1,
labels = c("A", "B"))
We can observe that people tend to reserve properties with 1 to 6 beds and there is no significant relationship between price and beds. apartments zith low number of beds tend to be in the same price range as the ones with 5 or 6 beds, probably because of other features.
Analyzing the relationship between Price and number of bedroom
bedroom_distr <- (ggplot(features_and_price,
aes(x = Price))
+ geom_histogram(bins = 15,
aes(y = ..density..),
fill = "#66CC99")
+ geom_density(lty = 2,
color = "#fb8072")
+ labs(title = "Distribution of prices vs Bedrooms numbers",
x = "Price",
y = "")
+ theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_text(size = 7))
+ facet_wrap(~ factor(bedrooms),
scales = "free_y"))
bedroom_distr
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
bedroom_box <- (ggplot(features_and_price)
+ geom_boxplot(aes(x = factor(round(bedrooms)),
y = Price,
fill = factor(bedrooms)))
+ labs(x = "# of Bedroom",
y = "Price",
fill = "# of Bedroom")
+ coord_flip())
bed_scatt <- (ggplot(data = features_and_price, aes(x = beds, y = Price, color=bedrooms)) +
geom_jitter(width = 0.1,height = 0.2,size=0.1))
ggarrange(bedroom_box,
bed_scatt,
nrow = 2,
ncol = 1,
labels = c("A", "B"))
The higher number of beds (meaning the higher number of guests included), the higher is the price, but it doesn’t imply a higher number of bedrooms and bathrooms. These listings (2 to 3 guests, 1 bedroom, 1 bathroom) probably refer to a private or shared room (which are cheaper).
For the listings with more than 2 bathrooms and even if the number of guests and the price keep increasing, the number of beds and bedrooms temp to reach a maximum value.
Altogether, data suggests that the number of bathrooms is not the most reliable factor to rely on to anticipate the price of an apartment on AirBnB. The number of beds or the number of guests included however seem to be more accurate in this regard. We can clearly see an increase of prices along with these two variables.
Cancellation policy and host response time
price_cancellation_policy <- ggplot(data = New_data,
aes(x = cancellation_policy, y = Price, color=cancellation_policy)) +
geom_boxplot(outlier.shape = NA) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(plot.title = element_text(color = "#971a4a", size = 12, face = "bold", hjust = 0.5))+
coord_cartesian(ylim = c(0, 1300))
host_data_without_null_host_response_time <- subset(New_data, host_response_time != "N/A" & host_response_time != "")
price_response_time <- ggplot(data = host_data_without_null_host_response_time,
aes(x = host_response_time, y = Price, color = host_response_time)) +
geom_boxplot(outlier.shape = NA) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(plot.title = element_text(color = "#971a4a", size = 12, face = "bold", hjust = 0.5)) +
coord_cartesian(ylim = c(0, 500))
ggarrange(price_response_time,
price_cancellation_policy,
nrow = 1,
ncol = 2,
labels = c("1. ", "2. "))
In the initial graph depicting the relationship between host response time and price, no discernible correlation is evident. However, upon examining the second graph, a notable influence of cancellation policy on price becomes apparent. The varying types of cancellation policies exhibit differing impacts on price, resulting in fluctuations in price levels.
Immediate reservation
ggplot(data = New_data, aes(x = instant_bookable, y = Price, color = instant_bookable)) +
geom_boxplot(outlier.shape = NA) +coord_cartesian(ylim = c(0, 500))
When looking at how price relates to whether a listing is instantly bookable or not, there doesn’t seem to be a clear connection. The instant bookings (represented by t) share the same price variation as the property which requires host’s acceptance (represented by f).
Analyzing availability of apartment by price
ggplot(features_and_price, aes(x = Price)) +
geom_histogram(binwidth = 100, fill = "skyblue", color = "black") + # Adjust binwidth as needed
labs(title = "Apartment Availability by Price",
x = "Price",
y = "Number of Apartments") +
theme_minimal()
The plot above shows that there’s no clear relation between the availability of apartments and their prices. However, We can say that the apartments in the price range $100 to $200 tend be significantly hosted higher in comparison to others. This is also relating to our previous analysis of distribution of average price per accommodation type.
Let’s plot the relationship between price and apartment availability each day over a year to see the variance of apartments for a given period.
ggplot( New_data, aes(availability_over_one_year, Price)) +
geom_point(alpha = 0.2, color = "#971a4a") +
geom_density(stat = "identity", alpha = 0.2) +
xlab("Availability over a year") +
ylab("Price") +
ggtitle("Relationship between availability over a year and price")
The plot above shows that there’s no clear relation between the availability of apartments over a year and their prices. The prices may fluctuate abruptly throughout the year and might depend upon other factors such as location and surroundings (lake view, beach side, downtown, etc.).
Let’s understand the same variation using different plot to see if we have a clear picture on availability over a year.
hchart(New_data$availability_over_one_year, color = "#336666", name = "Availability") %>%
hc_title(text = "Availability of listings") %>%
hc_add_theme(hc_theme_ffx())
From the graph, we can deduce that a lot of AirBnb listings are hosted between December and January, notably due to the Christmas and New Year’s time.
Now when our analysis of price verses apartments is finsihed, let’s explore the listings by hosts or superhosts.
# Count the number of apartments for each distinct host ID and include the host name
apartments_per_host <- New_data %>%
group_by(Host_id, Host_name) %>%
summarize(Num_Apartments = n_distinct(listing_id))
## `summarise()` has grouped output by 'Host_id'. You can override using the
## `.groups` argument.
# Print the number of apartments for each distinct host ID along with the host name
print(apartments_per_host)
## # A tibble: 43,651 × 3
## # Groups: Host_id [43,651]
## Host_id Host_name Num_Apartments
## <int> <fct> <int>
## 1 2626 Franck 2
## 2 2883 Shayne 2
## 3 3631 Anne 1
## 4 4175 Martin 1
## 5 6792 Jennifer Of Cobblestone Paris Rentals 24
## 6 7749 Jules 1
## 7 7903 Borzou 1
## 8 9011 Claire 2
## 9 9845 Marion 1
## 10 12764 Dorian 1
## # ℹ 43,641 more rows
Top 20 ‘Number of listings’ by owners
listings_per_host <- New_data %>%
group_by(Host_id, Host_name) %>%
summarize(Num_Listings = n_distinct(listing_id)) %>%
arrange(desc(Num_Listings))
## `summarise()` has grouped output by 'Host_id'. You can override using the
## `.groups` argument.
# `summarise()` has grouped output by 'Host_id'. You can override using the `.groups` argument.
# Select the top 20 hosts with the highest number of listings
top_20_listings <- head(listings_per_host, 20)
# Print the table of the top 20 hosts with the highest number of listings
print(top_20_listings)
## # A tibble: 20 × 3
## # Groups: Host_id [20]
## Host_id Host_name Num_Listings
## <int> <fct> <int>
## 1 2288803 Fabien 154
## 2 2667370 Parisian Home 138
## 3 12984381 Olivier 89
## 4 3972699 Hanane 78
## 5 3943828 Caroline 65
## 6 21630783 Pierre 65
## 7 39922748 Clara 63
## 8 789620 Charlotte 60
## 9 11593703 Rudy And Benjamin 56
## 10 152242 Delphine 53
## 11 3971743 Diane 53
## 12 7612270 Paul 53
## 13 5027164 International Home Owners 52
## 14 13013633 Benjamin 52
## 15 67879895 Guillaume 52
## 16 23025598 My Apartment In Paris 47
## 17 5056483 Bettina 43
## 18 1322370 Nicolas 42
## 19 2503671 SmartFlux 40
## 20 2107478 Philippe 39
Plotting the distribution of hosts versus their respective listings will give us some insights. We begin by grouping the hosts to visualize better, while removing any clumsiness of distribution.
count_by_host_1 <- New_data %>%
group_by(Host_id) %>%
summarise(number_apt_by_host = n()) %>%
ungroup() %>%
mutate(groups = case_when(
number_apt_by_host == 1 ~ "001",
between(number_apt_by_host, 2, 50) ~ "002-050",
number_apt_by_host > 50 ~ "051-153"))
count_by_host_2 <- count_by_host_1 %>%
group_by(groups) %>%
summarise(counting = n())
# Sort the count_by_host_2 data frame by the 'counting' column in descending order
count_by_host_2 <- count_by_host_2[order(-count_by_host_2$counting), ]
# Create bar chart for number of apartments per host
bar_num_apt_by_host <- ggplot(count_by_host_2, aes(x = groups, y = counting , fill = factor(groups))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = counting), vjust = ifelse(count_by_host_2$groups == "001", 0.0, -0.3), size = 3) +
labs(title = "Number of Apartments per Host Group \n ",
x = "Host Group",
y = "Number of Apartments",
fill = "Group") +
theme_minimal()
# Create bar chart for contrast between hosts and superhosts
bar_contrast_superhost <- ggplot(New_data) +
geom_bar(aes(x = '', fill = Superhost)) +
labs(title = "Contrast between Hosts and Superhosts",
x = NULL,
y = "Count",
fill = "Superhost") +
theme_minimal()
# Arrange plots in a grid
grid.arrange(bar_num_apt_by_host, bar_contrast_superhost, nrow = 2)
In this dataset, most of the hosts have one listing (that’s the case for 41521 owners, against only 3284 that have between 2 and 50 listings and 51 to 153 owners with 15 listings). We clearly have a minority of Superhosts in this dataset
Table of groups of owners according to their no. of apartment
table_representation <- data.frame(
Host_Group = count_by_host_2$groups,
Number_of_Apartments = count_by_host_2$counting
)
table_representation
## Host_Group Number_of_Apartments
## 1 001 40424
## 2 002-050 3212
## 3 051-153 15
last_date <- max(New_data$Host_since,na.rm = TRUE)
last_date
## [1] "2016-07-03"
It provides the maximum date observed in the dataset, indicating the most recent date up to which data is available being ‘03-07-2016’.
Number of hosts per year
new_hosts_data <- drop_na(New_data, c("Host_since"))
# Calculate the number of new hosts for each year (except for 2017 since our data is not complete for this year)
new_hosts_data$Host_since <- as.Date(new_hosts_data$Host_since, '%Y-%m-%d')
new_hosts_data <- new_hosts_data[new_hosts_data$Host_since < as.Date("2017-01-01"),]
new_hosts_data <- new_hosts_data[order(as.Date(new_hosts_data$Host_since, format="%Y-%m-%d")),]
new_hosts_data$Host_since <- format(as.Date(new_hosts_data$Host_since, "%Y-%m-%d"), format="%Y-%m")
new_hosts_data_table <- table(new_hosts_data$Host_since)
# Plot
plot(as.Date(paste(format(names(new_hosts_data_table), format="%Y-%m"),"-01", sep="")), as.vector(new_hosts_data_table), type = "l", xlab = "Time", ylab = "Number of new hosts", col = "Blue")
The analysis indicates that the dataset spans until 2016, limiting our ability to ascertain trends in new host numbers beyond this point. However, from 2008 to 2015, there was a discernible increase in the number of hosts. Nevertheless, in the subsequent two years, specifically from 2015 to 2017, there was a notable decline in number of host.
Number of listings by neighborhood
# Plot for number of listings by neighborhood
listings_neighb <- ggplot(New_data, aes(x = fct_infreq(Neighbourhood), fill = Room_type)) +
geom_bar() +
labs(title = "Number of Listings by Neighbourhood",
x = "Neighbourhood", y = "Number of Listings") +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 75, hjust = 1),
plot.title = element_text(color = "black", size = 12, hjust = 0.5))
# Plot the bar chart
listings_neighb
Average price per Neighbourhood
library(ggplot2)
# Calculate average daily price per city quarter
average_prices_per_arrond <- aggregate(cbind(New_data$Price),
by = list(arrond = New_data$city_quarter),
FUN = function(x) mean(x))
# Plot for average daily price per city quarter
price_arrond <- ggplot(data = average_prices_per_arrond, aes(x = arrond, y = V1)) +
geom_bar(stat = "identity", fill = "lightblue", width = 0.7) +
geom_text(aes(label = round(V1, 2)), size = 4) +
coord_flip() +
labs(title = "Average Daily Price per City Quarter",
x = "City Quarters", y = "Average Daily Price") +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(color = "black", size = 12, hjust = 0.5))
# Display the plot
print(price_arrond)
The most expensive districts are : 1st to 8th and the 16th. Their average price goes from around 100 to 159 dollars. It’s probably due to the fact that most of the monuments and touristic areas are either inside or nearby these districts.
Other districts have a mean price between 66 and 88 dollars. Most of the listings are located in these districts.
New_data %>%
group_by(Neighbourhood) %>%
dplyr::summarize(num_listings = n(), borough = unique(Neighbourhood)) %>%
top_n(n = 10, wt = num_listings) %>%
ggplot(aes(x = fct_reorder(Neighbourhood, num_listings), y = num_listings, fill = borough)) +
geom_col() +
coord_flip() +
labs(title = "Top 10 neighborhoods by nb. of listings", x = "Neighbourhood", y = "Nb. of listings")
table <- inner_join(New_data, R,by = "listing_id")
tab1 <- select(New_data,listing_id,city,city_quarter)
table = mutate(table,year = as.numeric(str_extract(table$date, "^\\d{4}")))
p6 <- ggplot(table) +
geom_bar(aes(y =city_quarter ,fill=factor(year)))+
scale_size_area() +
labs( x="Frequency", y="City quarter",fill="Year")+
scale_fill_brewer(palette ="Spectral")
ggplotly(p6)
The grapbh displays that the maximum listing was done in the year 2015. If the data for 2016 would be available after July, we would have seen a comparable figures between 2015 and 2016. We also observe that the listings are increasing each subsequent year since the inception of AirBnb gaining the popularity worldwide.
Evolution of apartments over years
#Convert Date type from factor to date
table["date"] <- table["date"] %>% map(., as.Date)
# Generating a table that aggregate data from data and id and count them
# to get the number of renting by host and date
longitudinal <- table %>%
group_by(date, Neighbourhood) %>%
summarise(count_obs = n())
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
time_location <- (ggplot(longitudinal,
aes(x = date,
y = count_obs,
group = 1))
+ geom_line(size = 0.5,
colour = "lightblue")
+ stat_smooth(color = "darkblue",
method = "loess")
+ scale_x_date(date_labels = "%Y")
+ labs(x = "Year",
y = "No. Rented Appartment")
+ facet_wrap(~ Neighbourhood))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
time_location
## `geom_smooth()` using formula = 'y ~ x'
The evolution of apartments over years shows similar pattern for all neighborhoods for which the listings have grown up, exceptionally for Bruttes-Montmartre and Popincourt.
Price range within Paris neighborhoods
# Filter data for Paris
paris_data <- New_data %>%
filter(city == "Paris" & !is.na(longitude) & !is.na(latitude) & longitude != "" & latitude != "")
# Calculate average price for each neighborhood
avg_price_per_neighborhood <- paris_data %>%
group_by(Neighbourhood) %>%
summarize(Avg_Price = mean(Price))
# Create the violin plot
violin_plot <- ggplot(paris_data, aes(x = Neighbourhood, y = Price, fill = Neighbourhood)) +
geom_violin() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Price Range within Paris Neighborhoods", x = "Neighbourhood", y = "Price") +
scale_fill_manual(values = rainbow(length(unique(paris_data$Neighbourhood))),
guide = guide_legend(title = "Neighbourhood (Avg. Price)"),
breaks = avg_price_per_neighborhood$Neighbourhood,
labels = paste(avg_price_per_neighborhood$Neighbourhood, " (", round(avg_price_per_neighborhood$Avg_Price, 2), ")"))
# Print the violin plot
violin_plot
We can see that that the price is higher around the center of Paris.
From the above plot, it is evident that certain districts, such as Elysée, Opera, and Palais-Bourbon, exhibit a higher concentration of properties. This observation aligns with the understanding that real estate prices tend to be notably higher in these districts compared to others.
This is an interactive map using Leaflet displaying the listings by neighborhood.
df <- select(L,longitude,neighbourhood,latitude,price)
leaflet(df %>% select(longitude,neighbourhood,
latitude,price))%>%
setView(lng = 2.3488, lat = 48.8534 ,zoom = 10) %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addMiniMap()
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
This is an interactive map using Leaflet displaying the listings owned by ‘Superhosts’ (a total of 2145 meaning around 4% of the total listings).
dfsuperhost <- select(New_data,longitude,Neighbourhood,latitude,Price)
dfsuperhost <- filter(New_data, Superhost =="t")
leaflet(dfsuperhost %>% select(longitude,Neighbourhood,
latitude,Price))%>%
setView(lng = 2.3488, lat = 48.8534 ,zoom = 10) %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions()) %>%
addMiniMap()
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
The predominant type of Airbnb listings in Paris are entire homes or apartments. Pricing of these listings is influenced by factors such as the number of beds, bedrooms, bathrooms, and capacity to accommodate guests. The type of listing (entire home or shared space) also plays a significant role in determining price.
There is a correlation between listing price and location. Neighborhoods with better amenities and higher desirability tend to have fewer Airbnb listings, but these listings command higher prices. Districts like Buttes-Montmartre, Popincourt, and Vaugirard are popular areas, while renowned Parisian quarters like Elysée, Palais-Bourbon, Louvre, and Luxembourg exhibit higher listing prices due to historical significance and tourist appeal.
Only a minority of hosts achieve Superhost status on Airbnb. Superhosts are recognized for providing exceptional guest experiences, as evaluated by guest reviews and other criteria. The stringent evaluation process ensures that Superhosts maintain high standards of hospitality, enhancing trust and satisfaction among guests.
Our analysis underscores the intricate interplay between property attributes, location dynamics, and host reputation in shaping the Airbnb landscape in Paris. These insights provide valuable guidance for both hosts and guests navigating the vibrant short-term rental market in the city.
sessionInfo()
## R version 4.3.3 (2024-02-29 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 zoo_1.8-12 here_1.0.1 kableExtra_1.4.0
## [5] highcharter_0.9.4 corrplot_0.92 leaflet_2.2.2 plotly_4.10.4
## [9] writexl_1.5.0 ggpubr_0.6.0 ggmap_4.0.0 lubridate_1.9.3
## [13] forcats_1.0.0 purrr_1.0.2 readr_2.1.4 tibble_3.2.1
## [17] tidyverse_2.0.0 ggplot2_3.4.4 stringr_1.5.0 dplyr_1.1.3
## [21] shiny_1.8.1 tidyr_1.3.0 skimr_2.1.5 DataExplorer_0.8.3
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 rlang_1.1.1 magrittr_2.0.3 compiler_4.3.3
## [5] mgcv_1.9-1 png_0.1-8 systemfonts_1.0.5 vctrs_0.6.4
## [9] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.1 backports_1.4.1
## [13] labeling_0.4.3 utf8_1.2.4 promises_1.2.1 rmarkdown_2.25
## [17] tzdb_0.4.0 xfun_0.40 cachem_1.0.8 jsonlite_1.8.7
## [21] later_1.3.2 jpeg_0.1-10 broom_1.0.5 parallel_4.3.3
## [25] R6_2.5.1 bslib_0.5.1 stringi_1.7.12 RColorBrewer_1.1-3
## [29] rlist_0.4.6.2 car_3.1-2 jquerylib_0.1.4 Rcpp_1.0.11
## [33] assertthat_0.2.1 knitr_1.44 base64enc_0.1-3 Matrix_1.6-5
## [37] httpuv_1.6.15 splines_4.3.3 igraph_2.0.3 timechange_0.2.0
## [41] tidyselect_1.2.0 rstudioapi_0.15.0 abind_1.4-5 yaml_2.3.7
## [45] curl_5.1.0 lattice_0.22-5 plyr_1.8.9 quantmod_0.4.26
## [49] withr_2.5.1 evaluate_0.22 xts_0.13.2 xml2_1.3.5
## [53] pillar_1.9.0 carData_3.0-5 generics_0.1.3 TTR_0.24.4
## [57] rprojroot_2.0.4 hms_1.1.3 munsell_0.5.0 scales_1.2.1
## [61] xtable_1.8-4 glue_1.6.2 lazyeval_0.2.2 tools_4.3.3
## [65] data.table_1.14.8 ggsignif_0.6.4 cowplot_1.1.3 grid_4.3.3
## [69] crosstalk_1.2.1 colorspace_2.1-0 nlme_3.1-164 networkD3_0.4
## [73] repr_1.1.7 cli_3.6.1 fansi_1.0.5 viridisLite_0.4.2
## [77] svglite_2.1.3 gtable_0.3.4 rstatix_0.7.2 sass_0.4.7
## [81] digest_0.6.33 htmlwidgets_1.6.4 farver_2.1.1 htmltools_0.5.8.1
## [85] lifecycle_1.0.3 httr_1.4.7 mime_0.12